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Abstract 

We present a real-time object-based SLAM system that leverages the largest object database to date. Our approach comprises 
two main components: 1) a monocular SLAM algorithm that exploits object rigidity constraints to improve the map and find its real 
scale, and 2) a novel object recognition algorithm based on bags of binary words, which provides live detections with a database 
of 500 3D objects. The two components work together and benefit each other: the SLAM algorithm accumulates information from 
the observations of the objects, anchors object features to especial map landmarks and sets constrains on the optimization. At the 
same time, objects partially or fully located within the map are used as a prior to guide the recognition algorithm, achieving higher 
recall. We evaluate our proposal on five real environments showing improvements on the accuracy of the map and efficiency with 
respect to other state-of-the-art techniques. 
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1. Introduction 

A robot that moves and operates in an environment needs to 
acquire live information about it in real time. This information 
can be obtained from Visual SLAM {simultaneous localization 
and mapping), a key component of many systems that allows 
mobile robots to create maps of their surroundings as they ex¬ 
plore them, and to keep track of the location of themselves. 
Computed maps provide rich geometrical information useful 
for reliable camera location, but it is poor for describing the ob¬ 
served scene. Recently, these maps have been augmented with 
objects to allow the robots to interact with the scene C1E1I3. 

To include objects in SLAM maps, these must be recognized 
in the images acquired by the robot by computing a rigid-body 
3D transformation. A vast research line has provided solutions 
to this problem ia|5l|6l|7l, but it has been aside from visual 
SLAM. 

Our aim in this paper is to approach object recognition and 
monocular object SLAM together, with a novel solution based 
on accumulating information over time to compute more ro¬ 
bust poses of objects and to keep them constantly located in 
the scene. To achieve this, we propose a novel object recogni¬ 
tion algorithm that provides detections of objects as a keyframe- 
based visual SLAM algorithm builds a map. 

Once an object is observed several times from different cam¬ 
era positions, those object features with several observations are 
triangulated within the map as anchor points. Anchor points 
provide the location of the object within the map and set addi¬ 
tional geometrical constraints in the bundle adjustment (BA) 
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optimization. Since object models are at real scale, anchor 
points provide observations of the map scale. 

Standard BA optimizes camera poses and map point loca¬ 
tions and it is well known that it can only recover maps up to 
scale. In contrast, our algorithm optimizes the camera poses, 
the points as well as the anchor points, the objects and the scale, 
and as a result we have maps at real scale composed by objects. 

Our system relies on an object recognition algorithm that 
works on a single-image basis but takes advantage of the video 
sequence. It exploits the information collected by SLAM to 
treat previous observations as cues for the location of the ob¬ 
jects in the current image. This allows to obtain faster and more 
repeatable detections that, in turn, provide more geometrical 
constraints to SLAM. 

The novel object recognition algorithm we propose, based on 
bags of binary words (HI, uses a static visual vocabulary that is 
independent of the number of objects, and models the entire ap¬ 
pearance of the objects with ORB {oriented FAST and rotated 
BRIEF) features 13. Poses of objects are found from 2D-to-3D 
correspondences that are refined by guided matching during a 
RANSAC-like step COl. Our system performs a fast and reli¬ 
able recognition of 3D objects with databases comprising up to 
500 objects, while keeping the real-time constraints of a SLAM 
system. 

Our work makes the following contributions: 

1. We present a complete visual SLAM system that is able to 
insert real objects in the map and to refine their 3D pose 
over time by re-observation, with a single monocular cam¬ 
era. 

2. We show the feasibility of storing hundreds of comprehen¬ 
sive 3D models in a single object database, composed of 
bags of binary words with direct and inverted indices. We 
also propose a novel technique to sample putative corre¬ 
spondences in the verification stage. 
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3. We propose a new SLAM back-end that includes the geo¬ 
metrical information provided by the objects into the map 
optimization to improve the accuracy of the map, the ob¬ 
jects and their relative scale at each step. 

4. We present results in real and independent datasets and 
comparisons with other systems. Our results proof that, by 
including objects, our monocular system can retrieve the 
real scale of the scene, and obtains more accurate results 
that PTAM El and RGB-D SLAM El, while keeping 
realtime performance (tracking takes 7.6 ms, and recogni¬ 
tion, around 200 ms per image). Our results also demon¬ 
strate that the system is extremely robust against occa¬ 
sional wrong detections, avoiding map corruption. 

The paper is distributed as follows: Section presents the 
related work of object SLAM and object recognition. Sectionj^ 
gives an overview of our complete system. Sectionj^details the 
visual SLAM approach and the object insertion, and Section 
the object recognition algorithm. Section shows the experi¬ 
mental evaluation of our system, and Section [7] concludes the 
paper. 

2. Related work 

Object-augmented mapping has been previously approached 
by SLAM methods based on the extended Kalman filter 121 [TSl. 
However, nowadays state-of-the-art monocular SLAM methods 
are based on keyframes, which create maps just with some se¬ 
lected video frames. As Strasdat et al. ifT^ proved, these sys¬ 
tems are able to produce better results than filter based methods 
because they handle a great deal of points and produce larger 
accurate sparse point maps in real time at frame rate. 

The work by Castle et al. 0 was one of the first ones 
that merged object recognition and monocular keyframe-based 
SLAM. After detecting an object in two frames, they compute 
its pose in the map. These objects are shown as augmented re¬ 
ality but, unlike our approach, they do not add the objects to 
the optimization. They built a database of 37 planar pictures 
described by SIFT features. Contrary to this system, restricted 
to planar objects, we can deal with objects with arbitrary 3D 
shapes. 

Bao et al. csiiiii were the first to present Semantic Struc¬ 
ture from Montion (SSfM), which is a framework to jointly op¬ 
timize cameras, points and objects. SLAM methods deal with 
the fact that the information proceeds from a video stream, 
thus the graph of points and keyframes is incremental, while 
SSfM processes all the frames at once. Moreover, recognition 
and reconstruction steps are separated and independent in na. 
However, on our algorithm, recognition and reconstruction take 
place at the same time since SLAM and object detection are 
fully integrated. The recognition method in ca retrieves a 
bounding box of the object while our object detector retrieves a 
6DoF pose. 

Along the same line, Fioraio et al. ifTTl presented a SLAM 
system that adds 3D objects in the maps when they are recog¬ 
nized with enough confidence, optimizing their pose together 
with the map by bundle adjustment. They build a database of 


7 objects that are described by 3D features that are acquired 
at several scales with a RCB-D camera, creating an indepen¬ 
dent index for each scale. The recognition is performed by 
finding 3D-to-3D putative correspondences that are filtered by 
a RANS AC-based algorithm. Although they are able to build 
room-sized maps with a few objects, their system does not run 
in real time. In comparison, our system improves scalability 
and execution time by using binary 2D features and a single 
index structure that can deal with all the keypoint scales at the 
same time. 

Salas-Moreno et al. Q presented one of the most recent vi¬ 
sual SLAM systems that combines RCB-D map building and 
object recognition. They represent the map with a graph in 
which nodes store position of cameras or objects, and enhance 
the pose of all of them when the overall graph is optimized. A 
database of objects is built beforehand with KinectFusion El, 
describing their geometry with point pair features El- These 
are indexed by a hash table, and the recognition is performed by 
computing a large number of candidate rigid-body transforma¬ 
tions that emit votes in a Hough space. Hough voting is a popu¬ 
lar technique for object detection with RCB-D data 120112111221 . 
but its scalability to hundreds of objects is not clear. In fact, 
Salas-Moreno et al. achieve real-time execution by exploit¬ 
ing GPU computation, but they show results with just 4 objects. 
In our work, we show results at high frequency with up to 500 
objects, computed on a CPU from a monocular camera. 

Regarding object recognition, our proposal follows the line 
of research consisting in finding matches of local features be¬ 
tween an image and an object model. Sivic & Zisserman 1231 
presented a visual vocabulary to match 2D images in large col¬ 
lections. They proposed to cluster the descriptor space of im¬ 
age features with k-means to quantize features and represent 
images with numerical vectors, denoted as bags of words, en¬ 
abling quick comparisons. On the other hand, Lowe ll24l pop¬ 
ularized an approach based on directly matching SIFT features 
between query and model 2D images. Matching features re¬ 
quires to compute the descriptor distance between large sets of 
features, which can be very time-consuming. To speed up this 
process, he proposed the best-bin-first technique to find approx¬ 
imate neighbors with a k-d tree. Both visual vocabularies and 
k-d trees were later generalized for matching large sets of im¬ 
ages in real time. Nister & Stewenius 1251 presented a hierar¬ 
chical visual vocabulary tree built on MSER {maximally stable 
extremal regions) l26l and SIFT features with which yielded 
fast detections with a dataset of 40000 images. Muja & Lowe 
l27l presented a method to automatically configure a set of k-d 
trees to best fit the feature data to match. 

To fully recover the pose of an object from a single image, it 
is necessary to incorporate 3D information to the models. Gor¬ 
don & Lowe m started to create 3D point cloud models, re¬ 
covering the object structure by applying structure from motion 
techniques. The pose could be then retrieved by solving the 
perspective-^-problem ca from 2D-to-3D correspondences. 
This has been the basis of a lot of recent object recognition 
approaches |5l[29l[7l[6l|30l. For example. Collet etal. 13 build 
3D models for 79 objects and use the training images of the ob¬ 
jects to build a set of k-d trees to index their SIFT features and 
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Figure 1: System overview: Every video frame is processed by the SLAM tracking thread to locate the camera, and to determine if a new keyframe is added to 
the map. Object recognition is applied to as many frames as possible, exploiting the information of the location of objects previously seen. If the recognition is 
successful, the observation of the object is stored until there is enough geometrical information about it. In that moment, the object instance is triangulated and 
inserted in the map, together with new map points anchored to object points and a subset of frames that observed them, coined semantic keyframes. This operation 
allows to find the map scale and to include object geometrical constraints to the map optimization. 


do direct matching. To enhance the detection of small objects 
and avoid the background, they run the recognition on small 
sets of SIFT features that are close in the query image, merg¬ 
ing the detections later if the object poses overlap. Although we 
also divide the query features into regions, we merge the 2D-to- 
3D correspondences before computing any pose. This prevents 
from missing detections in the cases of oversegmented regions 
with few correspondences. In a similar way, Pangercic et al. 
m create a database of 50 3D objects represented with a SIFT 
vocabulary tree, trained with the same object images. They rely 
on a RGB-D camera to segment out the background. 

The diverse discretization levels of trees allow to compute 
feature correspondences in several manners. For example, 
Hsiao et al. |[30l discretize the SIFT descriptor space in a hier¬ 
archical manner to create a 3-level tree. They show the bene¬ 
fits of computing feature matches at all the levels and not only 
the finest one, obtaining more putative correspondences that in¬ 
crease the object recognition rate. However, an excess of cor¬ 
respondences may overburden the pose recovery stage, leading 
to a large execution time. In contrast, Sattler’s et al. 1291 ap¬ 
proach retrieves correspondences only from those features that 
lie in the same visual word, but this may miss correct pairs of 
points that do not share the visual word due to discretization 
error. In our work, we use a direct index [iSj to compute corre¬ 
spondences between features that lie in the same tree node at a 
coarse discretization level. This provides a balanced trade-off 
between amount of corresponding points and execution time. 

All these works use SIFT or SURF features, which are de¬ 


scribed with vectors of 64 or 128 float values, and train match¬ 
ing trees with the same images with which the objects are mod¬ 
eled, which forces them to recreate the trees when new objects 
are added to the database. Rublee et al. presented ORB fea¬ 
tures, which are binary and compact (256 bit length descriptor), 
and provide a distinctiveness similar to that of SIFT and SURF 
ED. Furthermore, visual vocabularies of binary words created 
from independent data and that do not need reconstruction are 
suitable to index large collections of images [SI. We show in 
this work the viability of a single independent vocabulary of 
ORB features to recognize 3D objects with large databases (up 
to 500 objects) in real time (around 200 ms/image). 

3. System overview 

Our system builds a 3D map composed by camera poses, 
points and objects, as illustrated by Figure We make use 
of the front-end of the Parallel Tracking and Mapping (PTAM) 
algorithm CD to track the camera motion, and add two new 
parallel processes to perform object recognition and object in¬ 
sertion in the map. Our system also includes a completely 
redesigned back-end based on g2o 1^ that performs a joint 
SLAM optimization of keyframe poses, map points, objects and 
map scale. 

The SLAM tracking processes all the video frames to com¬ 
pute the pose of the camera at every time step with an unknown 
map scale. When a frame provides distinctive geometrical in¬ 
formation, it is inserted in the map as a keyframe together with 
new map points. 
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Simultaneously, object recognition is performed on as many 
frames as possible to search for known objects stored in an ob¬ 
ject model database. If there is available information of the 
location of objects, given by both the SLAM map and previous 
recognitions, this is exploited to guide the detection in the cur¬ 
rent image. A successful detection provides an observation of 
an object instance. 

Regardless of the recognition algorithm used, a detection ob¬ 
tained from a single image may be spurious or inaccurate. To 
avoid these problems, instead of placing an object in the map 
after a first recognition, we insert it in the SLAM map after 
accumulating consistent observations over time. The informa¬ 
tion given by all the observations is used to triangulate the ob¬ 
ject points, and hence the pose of the object inside the SLAM 
map. The resulting points are inserted in the 3D map as an¬ 
chor points, and the cameras that observed them, as semantic 
keyframes. These keyframes are not selected because of a geo¬ 
metrical criteria, but because they contain relevant semantic in¬ 
formation. The frames of the observations that do not provide 
parallax or distinctive geometrical information are discarded. 
Each triangulation provides us with an estimate of the map scale 
which we use to globally optimize it. 


4. Object-aware SLAM 

4.1. Object insertion within the map 

The recognitions of objects in single images yielded by our 
algorithm are used to insert those objects in the SLAM map. 
To robustly place them, instead of relying on a single detection, 
we accumulate several of them until we have enough geomet¬ 
rical information to compute a robust 3D pose. This process is 
depicted by Figure]^ and explained next. 

The object detector described in Section searches for ob¬ 
jects in as many frames as possible, whereas SLAM uses them 
to track the camera, so that the pose = [^wq UtwcJ of 
each camera i is known, with a map scale 5- that is initially un¬ 
known. A successful recognition of an object model O returns a 
transformation Tc^o from the camera to the object frame. Since 
multiple physical instances of the same object model may exist, 
we check to which instance this detection belongs. We do so by 
computing a hypothesis of the global pose = T^a Tqo 
of the detected object in the world, and checking for overlap 
with the rest of the objects of the same model that had been 
previously observed or are already in the map. Note that this 
operation is valid only if we already have an estimate of the 
map scale 5-. Otherwise, we assume that consecutive detections 
of the same model come from the same real object. After this, 
we determine the detection of object O is an observation of Ok, 
the k-th instance of model O. If there is no overlap with any 
object observed before, we just create a new instance. 

An observation = {Twq^ ^CiOk^ yields a set of 

correspondences between some 3D points of model O, Xq, and 
2D points of the image taken by camera C/, *11 i. For each cor¬ 
respondence (xo, Ui) e {Xo, Hi), if the parallax with respect 
to the rest of observations of xo of the same object instance is 
not significant enough, the corresponding pair is discarded. If 



Figure 2: Object insertion with a monocular camera, a) Object detection is 
performed as fast as possible on the frames of the video stream, b) The bottle 
is detected in some frames and its observed 2D points (red points) are accu¬ 
mulated. c) When several points are observed with enough parallax (yellow 
points), their frames are selected as semantic keyframes. Detection frames that 
offer no parallax are discarded, d) Observations from semantic keyframes are 
used to triangulate the object and its 3D points. The semantic keyframes (red 
cameras), the object and its points are inserted in the map, updating its scale. 


an object observation does not offer parallax or new points, it is 
completely disregarded. 

The observations of an object instance are accumulated un¬ 
til the following conditions hold: 1) at least 5 different object 
points xo are observed from two different positions, 2) with at 
least 3 degrees of parallax between the cameras, and 3) showing 
no alignment and a good geometrical conditioning. The points 
are triangulated in the world frame (xw) and the pairs {xq, xw) 
are inserted in the map as anchor points. The frames that offered 
parallax are also inserted as semantic keyframes. 

Anchor points play a decisive role on the object SLAM, be¬ 
cause they provide the location of the object within the map 
and set additional geometrical constraints in the BA enabling 
the map scale estimation. For this reason, anchor points have a 
different treatment than map points: they are not discarded by 
the maintenance algorithm of PTAM, are updated using new ob¬ 
ject observations only and are propagated among the keyframes 
of the map by using matching cross correlation in a 3 x 3 pixel 
region defined around the projected anchor point in the target 
keyframe. The patches for the correlation are extracted from the 
semantic keyframes and warped in order to compensate scale, 
rotation and foreshortening by means of a homography. 

By triangulating object points from several observations, we 
provide a more robust 3D pose than relying on a single detec¬ 
tion. Furthermore, this operation is necessary to find the map 
scale 5' if our only source of data is a monocular camera. Since 
we aim for Xw = Tc^Ok for each point, an estimate of 
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Figure 3: Object pose prior estimation (TwOk)- Red landmarks show the pairs 
<X( 9 , xw) of object points and anchor points. TwCi is the pose of the camera, 
with map scale .y already estimated, and TcjOj, is the pose of the object with 
respect to the camera for the last observation . 


the map scale is given by each triangulated object instance: 

SOk = 

arg min z i: II ^WCi^W - RCiOk^O - ^CiOk - • 

i {xo,xw) 

( 1 ) 

To insert the object in the map, we must compute its pose 
Ty^Ok iti the map world frame. After the first triangulation of an 
object, we compute an initial pose which is used as a prior in the 
subsequent SLAM optimization. The object pose prior is com¬ 
puted by composing the information provided by the SLAM 
and the object detector by means of equation This compo¬ 
sition is shown in Figure 

^wOk = [RwCiRciOk \RwCi^CiOk + siwCi] • (2) 

The pose of the semantic keyframe TwCi and the pose of the 
object with respect to the camera TQOk corresponds to the in¬ 
formation of the last observation . The scale s used is either 
the scale estimate sq,^ computed above, or the map scale 5- if 
we already had a previous estimate that had been refined in the 
optimization stage. 

The anchor points, the semantic keyframes and the object 
pose priors produced by each triangulation are then included 
in the optimization stage of the SLAM mapping algorithm to 
obtain more accurate values during the SLAM execution. 

4.2. Object SLAM optimization 

In standard keyframe-based SLAM, a sparse map of points 
Xw and the camera location of selected keyframes TwCi are es¬ 
timated by means of a joint bundle adjustment (BA). Figure [4^ 
shows a Bayesian network representing the estimation problem 
structure. 

The BA minimizes the map reprojection error, eji, between 
the j-th map point observed by the i-th keyframe and the corre¬ 
sponding measurement Ujt = (uji, : 

= I I - CamProj ^jw) ■ (3) 



(b) Object SLAM 


Figure 4: SLAM estimation problem, a) Bayesian network of standard SLAM. 
TwCi are the cameras, xjw the map points and Ufj is the measurement on the 
image, b) Bayesian network of object SLAM. Some objects are added to the 
BA, where the object location is represented as Two^ and the scale 5 becomes 
observable. Highlighted map points are those which belong to the objects. 


The point in the camera frame = T^^^ Xjw is projected 
onto the image plane through the projection function CamProj : 
M? \-^ defined by Devemay & Faugeras 1^ . 

Our goal is to include in the estimation the constraints given 
by the triangulation of a set of K objects, and then obtain opti¬ 
mized estimates of the map cameras, points, objects and scale. 
A triangulated object instance Rq^ = {TwOk^ Xq, Xw) com¬ 
prises a set of 3D points Xq in the object frame and its corre¬ 
sponding anchored landmarks in the map, with coordinates Xw 
in the SLAM world frame. It is well known that from monoc¬ 
ular sequences, the scene scale 5- is unobservable. However, if 
some map points correspond to a known size scene object (an¬ 
chor points) the resulting geometrical constraints allow to esti¬ 
mate the SLAM scale. Figure |4(b)| shows the new estimation 
problem structure after including anchor points. New nodes are 
added to the Bayesian network because new parameters have to 
be estimated: object locations Twot map scale 5-, which is 
observable now, being a single value for all the inserted objects. 

Each anchor point {xq, xw) e {Xq, Xw) sets a new con¬ 
straint, the object alignment error, defined as the differ¬ 
ence between the positions of the point when both measures are 
translated into the same object frame of reference: 

^jk = 'S-jOt - S R‘^oXjw + ■ (4) 


We propose a BA to iteratively estimate the scale, the map 
points, the cameras and the objects, by minimizing a robust ob¬ 
jective function combining the reprojection error © and the 
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object alignment error 0: 

N K 

p = arg min ^ ^ 7/ + Z Z ^ - (5) 

P i=l jeSi k=l jeSk 

where N is the total number of keyframes, St is the subset of 
map points seen by the i-th camera, Sk is the subset of anchor 
points of the k-th object instance, and are the informa¬ 
tion matrices of the reprojection error and the alignment error 
respectively. Errors are supposed uncorrelated and follow a 
Gaussian distribution, thus the covariance matrices are diago¬ 
nal. Regarding it is a 2x2 matrix and the measurement er¬ 
ror is cr^ = 2^^ where I is the level of the pyramid in which the 
feature was extracted. Similarly, QP is a 3x3 matrix with mea¬ 
surement error = 0.01^. ) is the Huber robust influence 

function 1^ : 


fj( \-I ^ if kl < ... 

I 26x^ - 5^ otherwise. 

Here, corresponds to the distribution, being r the 

number of degrees of freedom. The value for the reprojec¬ 
tion error is;^Q 05 (^) = 5.991 and for the alignment error 0 
= 7.815. 

The optimization vector is 

P = {s, ywC2^ • • • ’ ^WCn^ yWOi^- • • 5 ^WOk^ 5 ^Mw) ^ (7) 

where 5' is the map scale and v represents a transformation 
parametrized as a 6 component vector (rotation and translation) 
of the SE(3) Lie group. 

While the camera is exploring the scene, new keyframes are 
inserted in the map. To compute a prior pose for the new 
keyframes a sliding window is applied to the keyframes of the 
map and only four neighbor keyframes of the new keyframe and 
all the visible points are included in a BA, minimizing the re¬ 
projection error eq Q. The object pose priors are computed as 
previously explained in this section, following eq Global 
BA ([5]), including all the cameras, points and objects, is per¬ 
formed every time a new keyframe or a new object is inserted 
within the map. 

5. 3D Object recognition with large databases 

The object recognition requires a visual vocabulary built 
from an independent set of images, and a database of models 
that is created offline. Then, the recognition process is executed 
online in real time, performing two main steps on a query image 
taken at position T^^Cr detection of several model candidates 
that fit the image features, and verification of the candidates by 
computing a rigid body transformation between the camera and 
the objects. The candidates are obtained either by querying all 
the models in a database based on bag of words, or by taking 
advantage of previous known locations of objects. The verifica¬ 
tion step makes use of 2D-to-3D correspondences between im¬ 
age and object model points to And the object pose in the image 
TciO- The result are observations = {TwCi^ TqO’ 



Figure 5: Objects are modeled with a point cloud obtained from multiple view 
geometry. 


of the object models O recognized. The SLAM algorithm asso¬ 
ciates then these results to their corresponding object instances 
Ok taking the pose of the current camera into account (Sec¬ 
tion |4T]). 

5. 1 . Object models 

Our object models are composed of a set 3D points associ¬ 
ated to ORB descriptors and an appearance bag-of-words rep¬ 
resentation for the complete object. ORB features are computa¬ 
tional efficient because they describe image patches with strings 
of 256 bits. 

Each object model O is created offline from a set of training 
images taken from different points of view of the object. We use 
the Bundler and PMVS2 software |[35l|36l to run bundle adjust¬ 
ment on these images and to obtain a dense 3D point cloud of 
the object as shown by Eigurej^ We keep only those points 
that consistently appear in at least 3 images. Since objects can 
appear at any scale and point of view during recognition, we 
associate each 3D point to several ORB descriptors extracted at 
different scale levels (up to 2 octaves) and from several training 
images. 

If the point of view of the training images hardly differs, we 
may obtain 3D points with very similar descriptors that add lit¬ 
tle distinctiveness. To avoid over-representation, we convert 
features into visual words and keep the average descriptor per 
3D point and visual word (291 . Einally, an appearance-based 
representation of the object is obtained by converting the sur¬ 
viving binary features of all its views into a bag-of-words vec¬ 
tor with a visual vocabulary. This model provides information 
of all the object surface, so that a single comparison yields a 
similarity measurement independently of the viewpoint and the 
scale of the object in the query image. 

5.2. Object model database 

The object models are indexed in a database composed of 
a visual vocabulary, an inverted index and a direct index m. 
The visual vocabulary consists in a tree with binary nodes that 
is created by hierarchical clustering of training ORB descrip¬ 
tors. The leaves of the tree compose the words of the visual 
vocabulary. We used 12M descriptors obtained from 30607 in¬ 
dependent images from Caltech-256 Oil to build a vocabulary 
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with k = 32 branches and L = 3 depth levels, which yields 33K 
words. When an ORB feature is given, its descriptor vector 
traverses the tree from the root to the leaves, selecting at each 
level the node which minimizes the Hamming distance, and ob¬ 
taining the final leaf as word. By concatenating the equivalent 
words of a set of ORB features, we obtain a bag-of-words vec¬ 
tor, whose entries are weighted with the term frequency - in¬ 
verse document frequency (tf-idf) value, and normalized with 
the Li-norm. This weight is higher for words with fewer oc¬ 
currences in the training images, since they are expected to be 
more discriminative. 

The inverted index stores for each word in the vocabulary 
the objects where it is present, along with its weight in that ob¬ 
ject. When a query image is given, this structure provides fast 
access to the common words between the query bag-of-words 
vector and the model one. The direct index stores for each ob¬ 
ject model the tree nodes it contains and the associated ORB 
features. This is used to discriminate those features that are 
likely to match when 2D-to-3D correspondences are required 
in the verification stage. We can increase the amount of cor¬ 
respondences if we use the direct index to store nodes at other 
tree levels (coarser discretization levels), with little impact on 
the execution time 111 . In this work, we store nodes at the first 
discretization level of the vocabulary tree. 

5.3. Prior knowledge to obtain object candidates 

The first method to obtain detection candidates arises from 
those objects that have been previously observed or inserted in 
the map. Detecting objects that are already in the map is useful 
because we can find new points that were not anchored yet to 
landmarks. Inserting them help optimize the pose of the object. 
The process is described in Algorithm 

We have two sources of information of observed objects: tri¬ 
angulated objects inserted in the SLAM map, with optimized 
poses, and non-triangulated accumulated observations. From 
these, we can estimate the expected pose T* of each object 
instance in the current image if the map scale 5' has been esti¬ 
mated. If it has not, we assume that T* is the same than the 


Input: Query image taken at position T^Ci 
Input: Set O of objects previously observed 
Output: Set tB = {B\o^ ^ 20 ’ • • •) of observed objects 

foreach Oj, eO do 

Compute expected pose T* 

Project Po on image with T* 

Find new 2D-to-3D correspondences 
Estimate 3D pose to obtain observation 
if pose found then 

Remove image features *Ui G from image 

end 

end 

return B 

Algorithm 1: Recognition of objects previously observed 


last computed transformation TcjOk if it was obtained recently 
(up to 2 seconds ago). The transformation is computed 
as: 

{ '^wCi if Ok in the map, 

^C,Oi = ] '^WCi TwCj TcjOk if Ok not in the map but s known, 
y Tc Ok if unkown and i - j <2 secs. 

( 8 ) 

To obtain object candidates, we first extract ORB features 
from the query image. For every object instance Ok of which 
we can compute an expected pose if it is visible from 

the current camera C/, we project the object model 3D points 
Po on the image to look for correspondences following the 
same procedure explained in Section |5.5[ We estimate a 3D 
pose from these correspondences by solving the perspective-^- 
problem do). If it is successful, the utilized 2D features are 
removed from the image and an object observation is pro¬ 
duced. 

5.4. General retrieval of object candidates 

After trying to recognize objects previously seen, the general 
retrieval of object candidates is performed to find new detec¬ 
tions. This is described in Algorithm 

Objects can appear at any distance from the camera, so the 
detection of candidates should be robust against scale changes. 
Sliding window techniques |38l [39l are a common approach to 
face this difficulty by searching variant size areas of the image 
repeatedly. In contrast, we rely on dividing the image into re¬ 
gions of interest to perform detections in small areas, merging 
results if necessary. We run the Quick Shift algorithm l40l on 
the ORB features of the query image to group together those 
that are close in the 2D coordinate space to obtain regions of 
interest. Quick Shift is a fast non-parametric clustering algo¬ 
rithm that separates N-dimensional data into an automatically 
chosen number of clusters. In our case, each resulting 2D clus¬ 
ter defines a region of interest. 

The ORB features of each region of interest are converted 
into a bag-of-words vector v that queries the object database 
individually. The dissimilarity between the query vector and 
each of the object models w in the database is measured with a 


Input: Query image taken at position T^a 
Output: Set B = {B\q, B^o^ ...} of observed objects 
B^d) 

Divide image feature into regions of interest 
foreach region of interest do 
Query object database 

Compute correspondences with the top-10 candidates 

end 

foreach object candidate do 

Join correspondences from all the regions 
Verify detection and obtain observation B^^ 
if detection verified then B Bu {5*^} 

end 

return B 

Algorithm 2: General object recognition 
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(a) Putative correspondences from the entire image (b) Putative correspondences from regions 


Figure 6: Example of putative correspondences obtained in an image with 356 features. In |(a)| all the features are used to query an object database of 500 models 
and to compute correspondences. The correct object is the 7th candidate after querying, and 24 putative correspondences are computed, where 18 are incorrect. 
The object pose cannot be successfully verified after 100 random iterations trying sets of correspondences. On the other hand, in |(b)| regions of features are used 
individually to query the database and to produce putative correspondences. In this case, the correct object appears in the 1st position (out of 500) when the region 
that contains it is queried. In total, 9 correspondences are computed, where just 3 are incorrect. This makes it possible to verify the object and obtain its pose after 
34 iterations. 


score based on the Kullback-Leibler (KL) divergence 1411 . This 
score benefits from the inverted index since its computation 
requires operations between words in common only, while the 
properties of the KL divergence are kept (cf. [Appendix A ): 

w) = y V, log —, (9) 

' ^ Wr 

V/^0 A W/^0 

where e is a positive value close to 0. In Section |6.1| we com¬ 
pare the performance of the KL divergence with other popular 
metrics. The sparsity of vectors v and w highly differs, since 
an object model may comprise thousands of words whereas an 
image region just a dozen. This seems a difficulty for the re¬ 
trieval of the correct model, however the bag-of-words scheme 
can handle this situation because it can compare vectors inde¬ 
pendently of their number of words. If a single word in a region 
matches a model, its tf-idf weight can already produce an ob¬ 
ject candidate with a score. Perceptual aliasing may yield 
wrong candidates, but the correct model is expected to produce 
more correct word matches, lowering its dissimilarity value. As 
a result, the correct object model is likely to be retrieved even if 
only a few words in common are found. Thus, the top-10 object 
models that offer the lowest dissimilarity score with vector v are 
selected as detection candidates for each region of interest. 

Then, correspondences between the 2D image points and the 
3D model points are obtained. This operation is sped up by us¬ 
ing the direct index to filter out unlikely correspondences 111. 
Our segmentation into regions bears a resemblance with other 
approaches. For example, MOPED lO tries to recognize an 
object only with the correspondences obtained in each region, 
merging later the detections if their poses overlap. In contrast, 
we merge the correspondences of each region of interest ac¬ 
cording to their associated object model before computing any 
pose. This prevents from missing detections due to overseg¬ 
mented regions with few correspondences. We finally find the 
pose of the object candidates, or discard them, in the object ver¬ 
ification stage. 

Figure shows an example of how regions of interest can 
help find small objects: a carton bottle is searched for from a 
database with 500 object models. In Figure [6^ the query im¬ 
age is not divided into regions and the entire image queries the 



Figure 7: Example of a real object that lies in two different image regions. 
Since putative correspondences from different regions are merged, it is correctly 
found with 9 inliers. 

database. As a result, the background makes the correct model 
appear as the 7th best candidate, and prevents from obtaining 
correct correspondences, missing the detection. On the other 
hand, as shown by Figure [6(b)l when small regions are consid¬ 
ered, we find the correct model as the 1st candidate of its region, 
obtaining a better inlier ratio of correspondences (6 out of 9), 
being able to verify the recognition. Thus, in addition to detect 
small objects, regions of interest are helpful to establish better 
point correspondences. Figure [7] shows an example in which 
a toy van is divided into two regions. Since region correspon¬ 
dences are finally merged, it can be correctly recognized. 

5.5. Object verification and pose estimation 

After obtaining putative 2D-to-3D correspondences between 
the query image and the object candidates, we try to verify 
and find the pose of each object by iteratively selecting ran¬ 
dom subsets of correspondences and solving the perspective-^- 
point problem (PnP) cni. Plenty of algorithms based on ran¬ 
dom sample consensus (RANSAC) ifTOl exist to achieve this. 
For example, progressive sample consensus (PROSAC) ll42l ar¬ 
ranges the correspondences according to their distance in the 
descriptor space. Then, ordered permutations of low distance 
are selected as subsets for a parametrized number of tries, af¬ 
ter which the algorithm falls back to RANSAC. This is usually 
much faster than RANSAC when the pose can be found. How¬ 
ever, in the presence of mismatching correspondences with low 
descriptor distance (e.g. due to perceptual aliasing), PROSAC 
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(a) Some randomly selected putative correspondences yield a first pose. 



(b) A final pose is computed from the new correspondences obtained after pro¬ 
jecting the object model. 


Figure 8: Computation of a robust candidate pose in a DISAC iteration. 


may spend several tries trying subsets with outliers. Since we 
set a low number of maximum iterations (50) to limit the im¬ 
pact in the execution time, we propose a variation of PROSAC 
that eases the rigidity of the fixed permutations and is thus 
more flexible when there are low-distance mismatches. We 
coined it distance sample consensus (DISAC) and consists in 
drawing correspondences cj from the set of correspondences 
C = {ci,..., c„} with a probability inversely proportional to its 
Hamming distance h: 


P{Cj) = 


1 


n 


h{cj) Y, 

k=\ 


1 

h(ck) 


( 10 ) 


To avoid numerical inconsistencies, we set h(ci) = 1 when the 
distance is exactly 0. Now, in the case of outliers, there is a 
non-zero probability to avoid them even in the first iterations of 
DISAC. 

If a 3D pose is found with the selected subset of correspon¬ 
dences, we try to refine it by selecting additional correspon¬ 
dences that were not given by the direct index. For that, we 
project the object model 3D points Pq on the query image, ob¬ 
taining a set of 2D points. For each visible point xq we 

search a 7 x 7 patch centered at its projection for a matching 
ORB image feature u. We consider them to match if any of the 
ORB descriptors associated to the corresponding 3D point is at 
a Hamming distance lower than 50 units, which assures a low 
ratio of mismatches m If new correspondences are found, we 
compute a new refined pose tc^o^ ns shown by Figure]^ We 
measure the quality of a pose fc-o with a reformulation of Torr 
& Zisserman’s M-estimator B3l on the reprojection error that 
also takes into account the number of inlier correspondences: 

•^DisAC = Y (®’ CamProj(fc,o xo)||), (11) 

<U, Xo> 



Figure 9: Testing objects of the desktop dataset 



Figure 10: Example of Nister & Stewenius’s ED objects 


where CamProj is the projection function presented in Sec¬ 
tion |4^ and pe, n threshold in the reprojection error set to 3px. 

We keep the refined transformation of maximum ^-disac score 
of all the DISAC samples for each object candidate, if any, ver¬ 
ifying the recognition and finding the transformation Tc-o be¬ 
tween the current camera and the object. This, together with the 
collection of 2D-to-3D correspondences, composes the object 
observation = {TwCi^ ^CiOk^ "Ui) that feeds the SLAM 
algorithm. 


6. Experimental evaluation 

Our system has been implemented in C-f-f, as modules of the 
Robot Operating System (ROS) 1441 . exploiting parallelization 
with OpenMP Ea in the object candidate detection and ver¬ 
ification steps. All the tests were done on a Intel Core i7 @ 
2.67GHz PC. 

We evaluate our system in five different datasets with sets of 
from 7 to 500 objects: the Desktop dataset, used for testing pur¬ 
poses; one of the sequences of the RGB-D SLAM Dataset 1461 . 
which provides ground truth of the camera pose; the Aroa’s 
room dataset, a child’s real room with dozens of different ob¬ 
jects; the Snack dataset, a sequence that shows several instances 
of the same object models and force camera relocation; and the 
Snack with clutter dataset, a small area with repeated objects in 
a small space with occlusion and background clutter. 

6.1. Desktop dataset 

The desktop dataset is a 6’26” sequence of 640 x 480 images 
collected with an Unibrain camera on a desktop area, which we 
used to test our object recognition algorithm. The dataset shows 
the 6 objects illustrated in Figurewhose largest dimension is 
between 10 and 20 cm. These were modeled with consumer 
photo cameras. In addition to these objects, we created models 
from the image dataset provided by Nister & Stewenius 1251 . 
These are sets of 4 images depicting general objects, as those 
shown in Figure [T^ under different points of view and illumi¬ 
nation conditions. We used up to 494 sets of images to populate 
our object databases with models to be used as distractors for 
the object candidate detection step. 

We show first the results of the object candidate retrieval step 
from single images. 
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(a) (b) 

Figure 11: Performance of several similarity scores to retrieve candidates when 
|(a)| a database with 500 objects is queried, and |(b)| the top-10 candidates are 
retrieved from databases of different sizes. 




Figure 13: Resulting map of the desktop dataset with 500 models in the 
database. All the objects are correctly located in the space with no false posi¬ 
tives. 



Figure 12: Example of correct detections in the desktop dataset, with 500 ob¬ 
jects in the database. 


In addition to the KL divergence, we evaluated the perfor¬ 
mance to detect object models of other similarity metrics pop¬ 
ular in bag-of-words approaches ||47]| . We selected a set of 
300 640 X 480 images that show one object at a time from 
a distance of between 20 and 70 cm. Then, we manually 
masked out the background and query the database varying the 
amount of stored objects, computing the KL divergence, the 
Bhattacharyya coefficient, the distance and the Li-norm and 
L 2 -norm distances. Figure pT] shows the retrieval performance 
of each metric, defined as the percentage of correct object can¬ 
didates returned. Figure |ll(a)| shows the performance against 
a database of 500 models regarding the number of top results 
that we consider candidates. The KL divergence offers a higher 
performance in comparison with the rest of the metrics. We can 
see that the performance increases remarkably when we con¬ 
sider as candidates up to the top-10 results, where the perfor¬ 
mance stalls. Since increasing the number of candidates will hit 
the execution time in the verification step, we choose to select 
the top-10 object candidates for each region of interest. Fig¬ 
ure |ll(b)| shows the evolution of the metric scores when the 
top-10 candidates are selected from databases comprising from 
10 to 500 objects. We can see that the KL divergence always 
provides the best performance. 

When running our complete Object SLAM approach in the 


Figure 14: Example of robustness against inaccurate detections, a) Prior knowl¬ 
edge about the location of the objects (blue outline) is used to recognize objects. 
Some features (red dots) of the card are correctly matched with the model, but 
these are ill-distributed and the pose calculated in this single frame is inaccu¬ 
rate. b) In the next frame, the actual pose of the card remains correct because it 
is computed from all the accumulated observations. This allows to accurately 
detect the card again. 


desktop dataset, the 6 objects were correctly located in the 
space, with no false positive detections. Figure sh ows some 
correct object detections in single images. Figure] 13] shows the 
obtained map, including objects, keyframes (gray cameras), se¬ 
mantic keyframes (red cameras) and points. 

In some cases, the pose of an object obtained from a single¬ 
image detection may be inaccurate. Any algorithm is subject 
to this due to several factors, such as perceptual aliasing, or a 
bad geometrical conditioning. Since we do not rely on a single 
detection to locate an object in the space, our system is able to 
overcome from detection inaccuracies. For example, consider 
the case shown by Figure [14^ We have two prior locations of 
the chewing gum box and a card (blue outline), from which the 
two poses of the objects are obtained (red outline). However, 
the points observed of the card (red dots) are not widely dis¬ 
tributed, causing the recovered pose to be ill conditioned. This 
results in an inaccurate object pose although the 2D-to-3D cor¬ 
respondences are correct. Since we had additional geometrical 
information accumulated of previous observations of the card, 
its actually computed pose remained correct, as can be seen in 
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Figure 15: System execution time in the Desktop dataset 


the next detection, shown by Figure 14(b) where the recogni¬ 
tion of the card is fully accurate. 

The sequence is processed in real time. Figure p3(a)| shows 
the execution time of the SLAM tracking (block averaged for 
readability), which takes 3.3 ms on average. Figure p?^ shows 
the execution time taken by the object recognition process with 
each image, being 138 ms per image on average. Since the 
tracking and the recognition run in parallel, the SLAM map 
is created successfully in real time independently of the time 
taken by the object recognition. It is worth mentioning that 
the total execution time of our system, which performs object 
recognition and SLAM, is lower to the time consumption of 
other approaches that run object recognition only, as we show in 
Table [T] MOPED 0 is a state-of-the-art algorithm, highly op¬ 
timized to use thread parallelization, that recognizes 3D objects 
from SIFT features and retrieve its pose in the space from sin¬ 
gle images. To obtain its results, we ran MOPED in the desktop 
dataset, on the same computer and with the same 500 models. 
Our object recognition algorithm yields similar results to those 
by MOPED, as shown in Table |^(6>wr system, no priors). Fur¬ 
thermore, this table shows that our full system is able to provide 
more detections of objects when we exploit the prior locations 



Figure 16: Objects of the RGB-D SLAM Dataset 


Median Max. 

Our system (Object recognition -F SLAM) 0.14 0.34 
MOPED 111 (Object recognition) 0.52 0.95 


Table 1: Execution time of the object recognition stage of our system compared 
with MOPED, with a database of 500 objects (s/image). 



Our system 

Our system, 
no priors 

MOPED 0 

Bottle 

137 

41 

81 

Van toy 

104 

33 

8 

Box 

91 

56 

63 

Lion toy 

80 

5 

0 

Card 1 

258 

227 

166 

Card 2 

200 

118 

121 

Total detections 

870 

480 

439 


Table 2: Number of detections in the desktop dataset with a database of 500 
objects. By exploiting the priors given by the SLAM map we can provide 
more detections, even compared with MOPED, a state-of-the-art single-image 
recognition algorithm. 


obtained by the SLAM optimization over time. The advantage 
of the prior information is well illustrated with the lion toy. It 
is a challenging object to recognize because its repetitive red 
and white stripes are prone to cause mismatches. Our algo¬ 
rithm triangulated its initial pose from 4 observations, setting 
a prior location that enabled subsequent successful detections. 
This highlights the fact that any object recognition algorithm 
can be enhanced by the SLAM approach we propose. 

6.2. RGB-D SLAM Dataset 

Sturm et al. 1461 acquired several video sequences with a 
RGB-D Kinect camera to evaluate RGB-D SLAM systems. 
These are conveniently provided with the ground truth trajec¬ 
tory of the camera, obtained from a high-accuracy motion- 
capture system. We utilized this dataset to measure the accu¬ 
racy on camera location by our system with monocular images. 

We made use of several of their sequences: one to evalu¬ 
ate our Object SLAM, and the others to train the models of 
the objects that appear in the former. We ran our system on 



Figure 17: Map of objects created in the RGB-D SLAM sequence 
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Translation 

(cm) 

Rotation 

(deg) 

RMSE 

(cm) 

Our system 

3.4 + 2.5 

1.4+ 0.7 

4.2 

RGB-D SLAM fT2l 

9.6+ 5.7 

3.9 + 0.6 

11.2 

PTAM CD 

5.0+ 2.4 

2.1 +0.9 

5.6 



Translation 

(cm) 

Rotation 

(deg) 

RMSE 

(cm) 

Our system 

5.1+ 3.3 

1.5+ 0.9 

6.1 

RGB-D SLAM fT2l 

17.7 + 10.5 

4.9 + 2.3 

20.6 

PTAM CD 

6.0+ 3.1 

1.1 + 0.5 

6.8 


Table 3: Absolute Trajectory Error. Translation and rotation mean error 
(mean + std) and translation RMSE of our system in comparison with RGB-D 
SLAM and PTAM. 


Table 4: Relative Pose Error . Translation and rotation mean error (mean + std) 
and translation RMSE of our system in comparison with RGB-D SLAM and 
PTAM. 


the sequence titled freiburg3 nostructure texture near withloop 
in which the camera describes a loop, moving around some 
posters lying on the floor (shown by Figurep^. We built the ob¬ 
ject models with the validation version of the previous sequence 
and with the sequence nostructure texture far, which 
shows the same posters but from different camera positions and 
distances. Thus, we do not use the same data for training and 
evaluation. We created the object models by taking sparse RGB 
images of each poster (~20) and processing them as explained 
in Section |5.1[ We set their scale by reconstructing their 3D 
point clouds and measuring the real distance between pairs of 
points. As in the desktop dataset, besides these 8 models, we 
filled the database up to 500 models. 

Our system is able to recognize and place all the posters in 
the scene but two: the smallest one, for which no detections are 
obtained due to its size, and the one in the middle of the scene, 
because the camera moves very close to the floor and it barely 
focuses the center of the trajectory. 

The produced map is shown in Figure Its scale is suc¬ 
cessfully estimated from the triangulations of the objects. The 
average error obtained in the poses of the keyframes (the only 
poses that are optimized by the BA) is 3.4 cm in translation and 
1.4 degrees in rotation. 

We compared our system with the RGB-D SLAM algorithm 
CD and PTAM CD.' RGB-D SLAM creates a graph with the 
poses of the camera, linking the nodes with the relative trans¬ 
formation between them. These are obtained by computing 3D- 
to-3D correspondences from SURF features between pairs of 
images, by using the RGB and depth data of a Kinect camera. 



The graph is then optimized with g2o ED- Since their sensor 
provides depth, their map is at real scale. 

A qualitative comparison of the trajectories estimated by our 
system, RGB-D SLAM and PTAM is shown in Figurealong 
with the ground truth. We obtained this figure by aligning, ac¬ 
cording to the timestamps of the images, the evaluated trajec¬ 
tories with the ground truth by means of Horn’s method ||48]| . 
Since PTAM is a monocular system, we computed the scale by 
aligning the first meter of the trajectory in 7DoF. 

For a quantitative comparison we ran the sequence on each 
system 10 times and report the average Absolute Trajectory Er¬ 
ror (ATE) and Relative Pose Error (RPE) ||46l. ATE compares 
the absolute distances between the estimated and the ground 
truth trajectories after alignment; the results are on TableFor 
the rotation error, we computed the circular mean and standard 
deviation of the angles between the orientation of each pose 
with its corresponding one in the ground truth. We also show 
the root-mean-square error (RMSE) on translation. We report 
the relative pose error on Table which measures the local 
accuracy of the trajectory over a fixed time interval and cor¬ 
responds to the drift of the trajectory. Instead of restricting to 
evaluate in a fixed time interval, we compute the average over 
all possible time intervals ll46ll . The average ATE yielded by 
RGB-D SLAM is 9.6 cm in translation and 3.9 degrees in rota¬ 
tion. We observed this system creates a bias in the scale of the 
trajectories in some datasets that is producing this error. How¬ 
ever, the origin of this bias is not clear. We conclude that by 
introducing objects, our monocular system can retrieve the real 




Figure 18: Trajectory and scale estimated by our Object SLAM system in com¬ 
parison with RGB-D SLAM 1121 and PTAM 1111 in the RGB-D SLAM Dataset. 
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Figure 19: System execution time in the RGB-D SLAM dataset 













scale of the scene and create maps that are more accurate and 
contain richer information (3D objects and points). 

Figure 19(a) shows the execution time of SLAM tracking, 
which takes 7.6 ms on average. The execution time taken by 
the object recognition process is 220 ms per image on average 
with a database of 500 objects, and it is shown in Figure [19^ 


6.3. Aroa 's Room and Snack datasets 

We collected three more sequences to show qualitative re¬ 
sults of our system in challenging scenarios. 

The Aroa’s room dataset was collected with a Kinect camera 
(using the RGB sensor only) in a child’s real room, where we 
modeled a set of 13 objects (toys and pieces of furniture, such as 
blinds and a wall poster) of diverse size, with a consumer photo 
camera. Figure [20| shows the environment and some of the ob¬ 
jects. The main challenge of this scenario is the highly textured 
clutter that can produce mismatches in the object recognition. 
Figure shows the resulting map, where all the objects are 
located. The full execution can be watched on videcQ 

The Snack dataset shows a sequence recorded with a Uni¬ 
brain camera, where 10 bottles and cans, some of them identi¬ 
cal, are placed together on a table. The database is filled with 
21 models of snacks. Figure [2^ shows the 5 models that ac¬ 
tually appear on the table. In this sequence, we intentionally 
made SLAM lose tracking of the camera on two occasions. The 
observations of the objects are still merged once the camera is 
relocated, obtaining successful triangulations. Figureshows 
the system running. The two windows on top show the PTAM 
tracking and the 3D map created so far. Below, on the left, the 


Figure 23: Object SLAM running in the Snack dataset 



Figure 24: Map of objects created in the Snack dataset 


current object detections (red outline) and the prior knowledge 
about their position (blue outline). Figureshows the objects 
in the final map. The full sequence can be watched on videc|^ 
The Snack with clutter dataset shows a sequence with 6 bot¬ 
tles and cans, some of them identical, in the highly cluttered 
and textured scenario depicted by Figure The main chal¬ 
lenge of this scenario for the object recognition is that objects 
are placed very close each other and with remarkable occlusion, 
so that regions of features may not separate objects accurately. 
In spite of that, the object detector yields successful results in 
single frames, as those depicted by Figure]^ In Figures 26(a) 


^ http://youtu.be/cR_tkKpDZuo 


^ http://youtu.be/C3z62h6NPt4 
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Figure 25: Snack with clutter scenario 


Figure 27: Map of objects created in the Snack with clutter dataset 




(a) An occluded bottle is detected without prior information 




(b) An occluded can is detected without prior information 



(c) All objects are detected with prior information 


Figure 26: Successful detections in the Snack with clutter scenario 


and 26(b) both the bottle and the can are found before any prior 
information is available, and even if only a small part of them is 
visible. Although all the objects present in those two frames are 
not recognized, those detections create prior information that is 
exploited in next frames, making it possible to detect all the 
object, as shown by Figure 26(c) This exhibits the ability of 
our system to exploit the information provided by a sequence 
of images instead of working in a single image basis. Finally, 
our system produces the map shown by Figure The full 


Figure 28: Wrong priors are created around the blue bottles by inaccurate de¬ 
tections, but they are are not triangulated and the map remains correct. 


sequence can be watched on videc|^ 

These three sequences show that our system can create con¬ 
sistent 3D maps of points and objects handling very different 
objects at the same time, dealing with several instances of the 
same models, in highly cluttered and occluded scenes and even 
in cases in which track is lost. 

Our system provides safety checks at different stages to keep 
the map consistent. The erroneous observations are due to spu¬ 
rious detections. These rarely occur because of the feature 
match constraints imposed in the object recognition stage. If 
they happen or the computed pose is little accurate, the obser¬ 
vation accumulation stage prevents the wrong detections from 
damaging the map because several consistent observations with 
wide parallax are very unlikely. Figure illustrates this case. 
The blue lines outline the pose provided by the prior informa¬ 
tion. There is a prior around each object and two additional 
wrong ones around the blue bottles. These were created by 
inaccurate object detections. Since these observations did not 
match the correct detections, they were accumulated as new in¬ 
stances of the bottle. However, they are not triangulated be¬ 
cause they are not supported by other observations. In the rare 
case that a wrong instance was triangulated and anchor points 
created, the Huber robust influence function (equationwould 
decrease its impact in the optimization stage when there were 
enough correct anchor points in the map, keeping the map con¬ 
sistent. 


^ http://youtu.be/u8gvKahWtlQ 


14 











7. Conclusions 

We have presented an object-aware monocular SLAM sys¬ 
tem that includes a novel and efficient 3D object recognition al¬ 
gorithm for a database up to 500 3D object models. On the one 
hand, we have shown how embedding the single frame bag-of- 
words recognition method in the SLAM pipeline can boost the 
recognition performance in datasets with dozens of different ob¬ 
jects, repeated instances, occlusion and clutter. We believe that 
this benefit is not only achievable by this technique but by any 
other recognition method embedded within the SLAM pipeline 
that can exploit the accumulated observations of objects. 

On the other hand, inclusion of objects adds to the SLAM 
map a collection of anchor points that provides geometrical 
constraints in the back-end optimization and enables the real 
map scale estimation. We have shown our system can yield 
more accurate maps than other state-of-the-art algorithms that 
use RGB-D data. 

There is a case we have not addressed in this work: when 
the first object inserted in the map is originated by wrong ob¬ 
servations. This would cause a first incorrect scale estimate and 
lead to a missized map. This may be tackled by inspecting the 
variance of the scale estimates given by each object triangula¬ 
tion, so that any observation with an inconsistent scale could be 
eliminated. Alternatively, the problem might also be avoided if 
an initial rough scale estimate is available; for example, from 
the odometry or IMU sensors with which robots and mobile de¬ 
vices are usually equipped. Nevertheless, because of the safety 
steps of our approach, this case can rarely occur and it did not 
happen in our experiments. 

Including objects in maps paves the way to augment them 
with semantic data, providing enriched information to a user, 
or additional knowledge about an environment to an operating 
robot [|49l . We can use this knowledge in a future work to rea¬ 
son about the mobility of objects, making it possible to allow 
object frames to move in the 3D space, creating dynamic maps. 


Appendix A. Efficient KL-divergence computation 


Let V and w be two vectors such that ||v||i = ||w||i = 1, " V = 
{/ I Vi ^ 0},*V = {i I V/ = 0}, and analogously for TL and TL. 
The Kullback-Leibler divergence is defined as 


KL(v,w) = y Vjlog—. 

Z—i W; 


"V 


(A.1) 


To avoid undetermined values, we substitute w/ by a constant 
value £ ^ 0^ when w/ = 0, so we can rewrite ( |A.1| ) as 


KL(v, w) 


y v, iog^+ y v, iog^ (A.2) 
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(A.4) 


Since one of the addends depends only on vector v, we remove 
it when we want to compare the divergence between v (a query 
vector) with other vectors w (object models). Therefore, our 
score results in 


SKL(y,w)= y Vi log—. 

Z—/ w;- 


(A.5) 
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